# Libraries
library(ggplot2)
library(dplyr)
# Load the Titanic dataset
titanic <- read.csv("https://raw.githubusercontent.com/GeorgeOrfanos/Data-Sets
/refs/heads/main/titanic.csv")
19 Data Visualization with ggplot2
19.1 Introduction
In this chapter, we introduce data visualizations in R using
ggplot2, the most widely used package for graphical
representation in R. This package is part of the
tidyverse collection, making it highly compatible
with tools like dplyr for data manipulation. To keep
things simple and focused, we’ll use the
Titanic dataset for all examples. This dataset is
available from several public sources, including GitHub and Kaggle.
The logic behind ggplot2 is modular: we begin with a
basic plot and then add layers of various elements to specify
additional components, such as points, lines, labels, or colors.
Just as we use the pipe operator (%>%) in
dplyr to build data workflows step by step,
ggplot2 follows a
layered grammar of graphics. This approach allows
us to define clearly and systematically what should appear in a
plot, making our code flexible, readable and auditable.
19.2 Preparing the Environment
Let’s begin by loading the necessary R packages,
dplyr and ggplot2, and importing the
Titanic dataset:
The Titanic dataset contains information on 891 passengers aboard the RMS Titanic. This version is a cleaned subset of the original passenger manifest and is commonly used for educational and modeling purposes.
Since we’ll analyze this dataset in the next chapter as well, here we will only briefly describe the three variables we’ll focus on:
-
Fare: The fare paid for the ticket
-
Age: The age of the passenger in years
-
Survived: Indicates whether the passenger survived (
1) or did not survive (0)
To simplify our plots, we will create a subset of the dataset that
includes only these three variables, while transforming the variable
Survived to a factor:
# Subset the dataset to keep only the three relevant columns
titanic_subset <- titanic %>%
select(Fare, Age, Survived) %>%
mutate(Survived = as.factor(Survived))
19.3 Starting from Scratch
The plot below shows the relationship between Age and
Fare:
There is no strong association between these two variables—older
passengers could have either cheaper or more expensive tickets
compared to younger passengers. In statistical terms, there is no
clear correlation between Age and Fare.
Let’s now walk through on how to re-create this plot from scratch
using ggplot2.
Whenever we begin a new plot, we use the
ggplot() function. The first argument we need to
specify is the data we want to plot using the
data argument. For example, to start a plot based on
the titanic dataset:
# Start a plot
ggplot(data = titanic_subset)
This generates an empty plot. Although it might seem like something is missing, this is actually expected; even though we have specified the dataset, we have not yet told R what we would like to plot (or how to plot it).
Next, we define the aesthetic mappings using the
aes() function, which stands for aesthetics. In
ggplot2, aesthetic mappings link variables in the
dataset to visual properties like position (x, y), color, size, etc.
To set Age on the x-axis and Fare on the
y-axis, we use:
# Specifying aesthetics
titanic_subset %>%
ggplot(mapping = aes(x = Age, y = Fare))
Now the plot shows axes labeled with the correct variable names, but
still no data points, because we still have not yet defined how to
represent the data visually. To do that, we add a
geometry, which tells ggplot2 how to
display the data. For a scatter plot (where each point represents
one observation), we use geom_point():
# Add geometry to create a scatter plot
titanic_subset %>%
ggplot(mapping = aes(x = Age, y = Fare)) +
geom_point()
Notice that in ggplot2, we add layers to the plot using
the plus sign (+), not the pipe (%>%)
used in dplyr. The result is a functional scatter plot.
To refine its appearance, we can customize the
theme, which controls non-data elements like
background color, grid lines, font sizes, and margins.
ggplot2 offers several built-in themes. In our original
example, we used theme_light():
# Create a graph with pipes - Data, Aesthetic, Geometry, Theme
titanic_subset %>%
ggplot(mapping = aes(x = Age, y = Fare)) +
geom_point() +
theme_light()
The final plot looks exactly like the one we presented at the
beginning of this section. This example demonstrates the modular
nature of ggplot2: each layer builds on the previous
one to produce the final visualization.
To summarize, every plot in ggplot2 consists of four
main components: data, aesthetics, geometry, and theme. A fifth,
optional component is facets, which allow us to split one plot into
multiple subplots based on a variable. We will now explore each of
these layers in more detail, including how facets work.
19.4 Data and Aesthetics
The first thing we need to consider before creating a plot is the
data we want to visualize. For instance, the Age and
Fare variables are numeric (continuous), so it makes
sense to visualize them with a scatter plot. A bar plot, on the
other hand, typically requires a categorical variable along with a
continuous one — so we wouldn’t be able to create a meaningful bar
plot using only Age and Fare, since the
values of one variable would be treated as different categories.
At this stage, we are only choosing which data and variables we want to use. We haven’t yet specified whether we want a scatter plot or a bar plot, or even which variable should go on the x-axis. When we talk about data, we’re simply referring to the data frame that will be used for plotting. In earlier examples, when we created the blank canvas, that was the step where the data was specified.
Once the dataset is chosen, we decide on the aesthetics,
meaning which variables will be mapped to visual elements of the
plot. Aesthetics include everything we write inside the
aes() function. In the previous example, we included
Age and Fare in aes().
Although we didn’t see any shapes on the plot at that point, we
could already see the x and y axes labeled, meaning those aesthetics
have been mapped.
Besides the x and y axes, we can map more variables to additional
aesthetics while still keeping the plot two-dimensional. For
instance, we can add the Survived variable to color the
points in the scatter plot:
# Scatter plot with color mapped to a variable
titanic_subset %>%
ggplot(aes(x = Age,
y = Fare,
color = Survived)) +
geom_point() +
theme_light()
Each point now has a color based on the corresponding survival status. Similarly, we can map variables to other aesthetics:
-
fill: fills an area with color (used in bars, boxplots, etc.) -
shape: changes the shape of the point -
alpha: adjusts transparency -
size: adjusts the size of points or lines
Depending on the data type, these mappings may or may not make sense, so it’s worth experimenting with different combinations to explore their effects.
If instead we want to assign fixed values to these aesthetics (not
based on a variable), we do this outside of the
aes() function, inside the corresponding
geom_*() layer. For example, here we set all points to
blue, manually:
# Scatter plot with color set to a fixed value
titanic_subset %>%
ggplot(aes(x = Age,
y = Fare)) +
geom_point(color = "blue") +
theme_light()
It might seem confusing at first, but the rule is simple:
-
If the aesthetic is mapped to a variable → put it inside
aes() -
If the aesthetic is set to a fixed value → put it outside
aes()
It is important to note that we can specify aesthetics either in the
ggplot() function or inside the
geom_point() function. When aesthetics are defined in
ggplot(), they are inherited by all subsequent layers
(geoms); we will see later in this chapter how to add multiple
geoms. In contrast, when aesthetics are specified inside a specific
geom_ function, they apply only to that particular
layer.
There are additional aesthetics we can adjust to control the
appearance of the plot. For example, we can limit the x-axis and
y-axis ranges using the xlim() and
ylim() functions respectively:
# Limit the range of the x-axis and the y-axis
titanic_subset %>%
ggplot(aes(x = Age,
y = Fare,
color = Survived)) +
geom_point() +
theme_light() +
xlim(0, 120) +
ylim(0, 600)
This sets the x-axis range from 0 to 120 and the y-axis range from 0 to 600. Any points outside these limits will be omitted, and R will issue a warning when the code is executed.
We can also change how values are scaled using functions that start
with scale_*(). For example, to show the x-axis in
log-10 scale:
# Apply log10 transformation to the x-axis
titanic_subset %>%
ggplot(aes(x = Age,
y = Fare,
color = Survived)) +
geom_point() +
theme_light() +
scale_y_log10()
We often want to add titles and axis labels to make our plots easier
to understand. This is done using the labs() function:
# Add custom axis labels and title
titanic_subset %>%
ggplot(aes(x = Age,
y = Fare,
color = factor(Survived))) +
geom_point() +
theme_light() +
labs(
x = "Passenger Age",
y = "Ticket Fare",
color = "Survived",
title = "Ticket Fare by Age and Survival Status")
Here, labs() adds labels for the x-axis, y-axis,
legend, and title. Although not technically part of the core
aesthetics, labs() is conceptually linked to the
aesthetic layer because it describes how aesthetics are
communicated.
When a variable is mapped to an aesthetic, we can use the
scale_*_manual() family of functions to manually
control how values are displayed. Here’s how to change the colors
assigned to survival status:
# Manually set colors for the 'Survived' variable
titanic_subset %>%
ggplot(aes(x = Age,
y = Fare,
color = factor(Survived))) +
geom_point() +
theme_light() +
scale_color_manual(values = c("0" = "red", "1" = "green")) +
labs(color = "Survival Status")
We can also manually set shapes, fill colors, transparency, and sizes using similar functions:
-
scale_fill_manual() -
scale_shape_manual() -
scale_alpha_manual() -
scale_size_manual()
For example, setting custom shapes:
# Manually set shapes for the 'Survived' variable
titanic_subset %>%
ggplot(aes(x = Age,
y = Fare,
shape = factor(Survived))) +
geom_point() +
theme_light() +
scale_shape_manual(values = c("0" = 1, "1" = 16)) +
labs(shape = "Survival Status")
These tools give us full control over how your aesthetics appear in the plot, allowing you to adapt your graph for clarity, emphasis, or even publication styling.
There are still more aesthetics available beyond those mentioned
above, depending on the type of plot and driving the visual effect
one aims for. The full list of available aesthetics can be found on
the
ggplot2 official documentation site
or by using the ?aes help command in RStudio.
19.5 Geometry
So far, we have specified the data and the aesthetics we want to use in a graph, but we have not yet specified what kind of plot we want to create. Should it be a scatter plot, a histogram, or a bar chart?
In ggplot2, we define the type of plot using a function
that starts with geom_*(). For example, earlier we used
geom_point() to create a scatter plot. We could have
used geom_line() to create a line plot instead.
(Although it might not yield any useful insights in our Titanic
data!) The geometry function determines the form of our
plot.
There are many types of geometries available in
ggplot2. In this section, we will focus on some of the
most common ones, namely:
Scatter plots
Histograms and density plots
Bar plots
Box plots
Line plots
19.5.1 Scatter Plots
We already created a scatter plot at the beginning of this chapter
using the geom_point() function. A scatter plot is
ideal when we want to display two numeric variables on the axes.
We can add more variables to a scatter plot by changing an element
of the plotted points. For example, we previously used the
variable Survived to color the points - we could
similarly change shape.
19.5.2 Histogram and Density Plots
In the Statistical Distributions chapter, we introduced histograms and density plots as tools to visualize the distribution of a variable. Now let’s see how to create them.
We use geom_histogram() and
geom_density() to generate histogram and density
plots respectively. These are one-dimensional plots, so we only
need to specify the x variable inside the
aes() function:
# Histogram of Age
titanic_subset %>%
ggplot(aes(x = Age)) +
geom_histogram() +
theme_light()
The distribution of Age looks roughly normal. We can change the
granularity of the histogram by using the (number of)bins
or binwidth arguments in
geom_histogram(). We only need to set one of the two,
as the other one will be set automatically:
# Histogram with 5 bins
titanic_subset %>%
ggplot(aes(x = Age)) +
geom_histogram(bins = 5) +
theme_light()
# Histogram with bin width of 10
titanic_subset %>%
ggplot(aes(x = Age)) +
geom_histogram(binwidth = 10) +
theme_light()
We can also explore the distribution of Fare, using a
density plot.
# Density Plot of Fare
titanic_subset %>%
ggplot(aes(x = Fare)) +
geom_density() +
theme_light()
The variable Fare seems to follow a log-normal
distribution, with a long right tail.
We can also combine these two types of geometries into one plot —
but in that case, we need to manually specify the y-axis. A
histogram shows counts on the y-axis, while a density plot shows
densities. To combine them, we use both
geom_histogram() and geom_density(), and
set the y aesthetic inside aes() to the
somewhat unusual expression ..density... This special
notation tells ggplot2 to scale the histogram so that
it represents a density rather than raw counts, allowing it to
align correctly with the density curve. We also modify the
appearance to make the plot clearer and demonstrate how different
aesthetics can be combined:
# Combined Histogram and Density Plot
titanic_subset %>%
ggplot(aes(x = Age, y = ..density..)) +
geom_histogram(fill = "grey") +
geom_density(size = 2, color = "blue") +
theme_light()
19.5.3 Bar plots
Bar plots are an excellent choice for visualizing categorical
data. To create a bar plot, we use the
geom_bar() function. As with histograms and density
plots, we need to include only one variable on the x-axis. Let’s
plot the Survived variable:
# Bar plot
titanic_subset %>%
ggplot(aes(x = Survived)) +
geom_bar() +
theme_light()
Similar to geom_histogram(), the y-axis here
represents the number of observations. In fact, a bar plot can be
thought of as a histogram for categorical variables, where each
“bin” corresponds to a specific category. This plot shows that
most passengers on the Titanic did not survive.
Instead of plotting counts on the y-axis, we may sometimes want to
show a summary statistic. For instance, the average ticket price
(Fare) within each category. In this case, the x-axis
still displays the categories, but the y-axis will show a
numerical value. To do this, we use the
group_by() and summarize() functions
from the dplyr package to transform the data before
plotting. While this is not strictly part of ggplot2,
it’s an important reminder that data often needs to be manipulated
into the right shape before plotting.
Let’s first compute the average ticket price for each category of
Survived:
# Average fare per category
average_fare <- titanic_subset %>%
group_by(Survived) %>%
summarize(Average_Fare = mean(Fare))
# Print the results
average_fare
# A tibble: 2 × 2
Survived Average_Fare
<fct> <dbl>
1 0 22.1
2 1 48.4
Now that we have a summary table, we can fill in both
x and y inside aes(), just
like we would for a scatter plot with two numeric variables.
However, geom_bar() won’t work when we explicitly
supply a y variable and so we use the
geom_col() function instead:
# Plot of Average Fare per Category
average_fare %>%
ggplot(aes(x = Survived, y = Average_Fare)) +
geom_col() +
theme_light()
The average ticket price among survivors was higher than among non-survivors. This is an interesting insight as it may suggest that passengers who paid more had a higher priority when boarding the life boats or were located in more favorable areas of the ship (closer to the life boats perhaps?) before it sank.
19.5.4 Box plots
To visualize the distribution of a numeric variable, we previously used histograms and density plots. Another way to do this is with a box plot. As the name suggests, a box plot is essentially a… plot that includes a box, which represents the values close to the center of the distribution.
A box plot typically displays five summary statistics known as the five-number summary. These include:
-
Minimum: The smallest value
-
First Quartile (Q1): The 25th percentile, marking the lower edge of the box
-
Median (Q2): The 50th percentile, shown by the line inside the box
-
Third Quartile (Q3): The 75th percentile, marking the upper edge of the box
-
Maximum: The largest value
Let’s create a box plot for the Age variable using
geom_boxplot():
# Box plot
titanic_subset %>%
ggplot(aes(x = Age)) +
geom_boxplot() +
theme_light()
The black horizontal line represents the median. The box contains
all values between the first and third quartiles, while the black
dots outside the box are outliers. This plot shows that the
Age variable is fairly symmetric, with just a few
outliers on the right-hand side.
While histograms or density plots are preferred when visualizing
the distribution of a single variable, box plots are excellent for
comparing distributions across categories. To do this, we place
the numeric variable on the y-axis and the categorical variable on
the x-axis. For instance, the following plot shows the
distribution of Age across survival categories:
# Box plot
titanic_subset %>%
ggplot(aes(x = Survived, y = Age)) +
geom_boxplot() +
theme_light()
The centers of the two distributions are nearly at the same level, with the distribution of non-survivors being slightly higher. This might reflect the fact that older passengers were slightly less likely to survive the disaster.
19.5.5 Line Plots
Line plots are another useful type of plot, although we haven’t encountered them in previous chapters. As the name suggests, a line plot connects data points with a line, helping to visualize trends or changes over time or ordered values.
To see how this works, let’s manually create a simple dataset with just a few data points:
# Create a simple data set
simple_data_set <- tibble(
x = c(10, 8, 13, 9, 11, 14, 6, 4, 12),
y = c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84))
Since we have two numeric variables, we can first create a scatter
plot using the geom_point() function:
# Scatter plot
simple_data_set %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
theme_light()
For a line plot, we use the geom_line() function
instead of geom_point(). This connects the data
points with a line:
# Line plot
simple_data_set %>%
ggplot(aes(x = x, y = y)) +
geom_line() +
theme_light()
In the line plot above, the data points are not shown—only the
line appears. To include both the individual points
and the connecting line, we simply add the
geom_point() function along with
geom_line():
# Scatter-Line plot
simple_data_set %>%
ggplot(aes(x = x, y = y)) +
geom_line() +
geom_point() +
theme_light()
This combined plot is often used to show both the trend (line) and the individual values (points), especially when the number of points is small and we want to see both clearly.
19.6 Facets
As we mentioned earlier, facets can be considered a separate, fifth plot element. Facets allow us to break down a plot into multiple subplots based on the values of at least one categorical variable. While we could always create separate plots manually for each category, facets provide a convenient way to generate multiple plots at once and display them side by side, making comparisons easier and more intuitive.
In ggplot2, there are two main functions for creating
facets: facet_grid() and facet_wrap(). The
key difference between them lies in how they organize the resulting
plots.
-
facet_grid()arranges plots in a grid format. One categorical variable is mapped to the rows and another to the columns. This is ideal for visualizing interactions between two categorical variables. -
facet_wrap(), in contrast, uses a single categorical variable and wraps the plots into a series, typically laid out in multiple rows or columns. This is especially useful when there’s only one categorical variable involved.
To understand how these two functions work, let’s revisit the
scatter plot we created earlier in the chapter. This time, we add
facet_grid() to break the plot into two separate
scatter plots—one for each category of the
Survived variable:
# Scatter plots with grid (rows)
titanic_subset %>%
ggplot(aes(x = Age, y = Fare)) +
geom_point() +
theme_light() +
facet_grid(Survived ~ .)
With both facet_grid() and facet_wrap(),
we use the tilde (~) symbol to define the layout of the
plots. Since Survived appears on the left-hand side
here, the plots are arranged in rows. If we want the plots arranged
in columns, we simply place Survived on the right-hand
side:
# Scatter plots with grid (columns)
titanic_subset %>%
ggplot(aes(x = Age, y = Fare)) +
geom_point() +
theme_light() +
facet_wrap(. ~ Survived)
Using facet_wrap() instead yields a similar result when
dealing with only one categorical variable:
# Scatter plots with wrap
titanic_subset %>%
ggplot(aes(x = Age, y = Fare)) +
geom_point() +
theme_light() +
facet_grid(. ~ Survived)
In this case, because Survived has only two levels,
both functions produce nearly identical visual outputs.
To better understand the differences between these two functions,
let’s now introduce a second categorical variable. We’ll create an
Age_Category variable that classifies passengers as
“Underage” (under 18), “Adult” (18 or older), or “Missing” (if the
age is not available):
# Create Age_Category
titanic_subset <- titanic_subset %>%
mutate(
Age_Category = case_when(
is.na(Age) ~ "Missing",
Age >= 18 ~ "Adult",
TRUE ~ "Underage"))
Now, let’s use facet_grid() to create a grid of density
plots for Fare, with Age_Category on the
rows and Survived on the columns:
# Density plots with grids
titanic_subset %>%
ggplot(aes(x = Fare)) +
geom_density() +
theme_light() +
facet_grid(Age_Category ~ Survived)
This results in six separate density plots. The rows represent the age categories, and the columns represent the survival categories.
If we use facet_wrap() instead, the layout changes:
# Density plots with wraps
titanic_subset %>%
ggplot(aes(x = Fare)) +
geom_density() +
theme_light() +
facet_wrap(Age_Category ~ Survived)
We still get the same six plots, but now each one is displayed as
part of a flexible layout with a combined label like
Age_Category = Underage, Survived = 1 at the top of
each panel. This makes facet_wrap() especially useful
when the combinations of variables don’t form a clean grid.
19.7 Theme
The ggplot2 package provides a default visual theme, as
we’ve seen in all our plots so far. However, we can change this
default theme by using one of the many alternatives included in the
package, or even by customizing a theme to match our own
preferences. In fact, we have already been using the
theme_light() function in our previous graphs to apply
a different look than the default one.
Playing with different themes is largely a matter of personal taste,
but the built-in themes in ggplot2 are generally high
quality and widely used. Choosing a theme helps ensure clarity and
consistency in the way your plots are presented.
Because nearly every visual element in a plot can be customized, we’ll focus on the basic intuition behind theme customization.
Let’s say we want the tick labels on the x-axis to appear larger. If
we’re using RStudio, we can type "axi" inside the
theme() function to explore available options via
autocomplete. Doing this reveals that the correct argument is
axis.text.x.
This tells ggplot2 that we want to change the text
labels on the x-axis. To specify how we want to change them, such as
adjusting the size or color, we use the
element_text() function. For example, here’s how we
would increase the text size for the x-axis in our original scatter
plot:
# Change the size of text on the x-axis
titanic_subset %>%
ggplot(aes(x = Age, y = Fare)) +
geom_point() +
theme_light() +
theme(axis.text.x = element_text(size = 15))
As a result, the x-axis labels now appear larger. We can clearly see this when comparing them with the unchanged y-axis labels. We can also modify other attributes, such as color. In the example below, the x-axis labels are now both larger and blue:
# Change the size and color of text on the x-axis
titanic_subset %>%
ggplot(aes(x = Age, y = Fare)) +
geom_point() +
theme_light() +
theme(axis.text.x = element_text(size = 15, color = "blue"))
This example gives just a glimpse into the types of layout changes
you can make using theme(). It’s highly recommended to
experiment with these options and adjust your plots based on the
needs of your audience or the context of your analysis.
For more details, you can visit the official ggplot2 website to
explore theme documentation. Additionally, RStudio provides publicly
available
ggplot2 cheat sheets, which are a helpful reference when customizing your plots.
19.8 Recap
In this chapter, we explored key types of plots in
ggplot2, including bar plots, box plots, line plots,
and the use of facets to create multiple subplots based on
categorical variables. We also discussed how to customize the
appearance of plots through themes and how to control aesthetics
either globally within the ggplot() function or locally
within individual geoms. These tools form the foundation for
effective and flexible data visualization in R.